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Abstract — We address the sparse signal recovery problem 
in the context of multiple measurement vectors (MMV) when 
elements in each nonzero row of the solution matrix are tem- 
porally correlated. Existing algorithms do not consider such 
temporal correlation and thus their performance degrades sig- 
nificantly with the correlation. In this work, we propose a 
block sparse Bayesian learning framework which models the 
temporal correlation. We derive two sparse Bayesian learning 
(SBL) algorithms, which have superior recovery performance 
compared to existing algorithms, especially in the presence of 
high temporal correlation. Furthermore, our algorithms are 
better at handling highly underdetermined problems and require 
less row-sparsity on the solution matrix. We also provide analysis 
of the global and local minima of their cost function, and show 
that the SBL cost function has the very desirable property that 
the global minimum is at the sparsest solution to the MMV 
problem. Extensive experiments also provide some interesting 
results that motivate future theoretical research on the MMV 
model. 

Index Terms — Sparse Signal Recovery, Compressed Sensing, 
Sparse Bayesian Learning (SBL), Multiple Measurement Vectors 
(MMV), Temporal Correlation 

I. Introduction 

Sparse signal recovery, or compressed sensing, is an emerg- 
ing field in signal processing [l]-[4]. The basic mathematical 
model is 

y = *x + v, (1) 

where e R NxAI (N <C M) is a known dictionary matrix, 
and any N columns of <& are linearly independent (i.e. satisfies 
the Unique Representation Property (URP) condition [5]), 
y 6 R JVxl is an available measurement vector, and v is 
an unknown noise vector. The task is to estimate the source 
vector x. To ensure a unique global solution, the number 
of nonzero entries in x has to be less than a threshold [5], 
[6]. This single measurement vector (SMV) model (HJ has a 
wide range of applications, such as electroencephalography 
(EEG)/Magnetoencephalography (MEG) source localization 
[7], direction-of-arrival (DOA) estimation [8], radar detection 
[9], and magnetic resonance imaging (MRI) [10]. 

Motivated by many applications such as EEG/MEG source 
localization and DOA estimation, where a sequence of mea- 
surement vectors are available, the basic model ([T} has been 
extended to the multiple measurement vector (MMV) model 
in [11], [12], given by 



Y = #X + V, 



(2) 
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where Y = [Y.i,-- - ,Y. L ] € R NxL is an available mea- 
surement matrix consisting of L measurement vectors, X = 
[X.i,-- - , X.i] <E R Mxi is an unknown source matrix (or 
called a solution matrix) with each row representing a possible 
sourceQ, and V is an unknown noise matrix. A key assumption 
in the MMV model is that the support (i.e. indexes of nonzero 
entries) of every column in X is identical (referred as the 
common sparsity assumption in literature [12]). In addition, 
similar to the constraint in the SMV model, the number of 
nonzero rows in X has to be below a threshold to ensure a 
unique and global solution [12]. This leads to the fact that X 
has a small number of nonzero rows. 

It has been shown that compared to the SMV case, the 
successful recovery rate can be greatly improved using multi- 
ple measurement vectors [12]— [15]. For example, Cotter and 
Rao [12] showed that by taking advantage of the MMV 
formulation, one can relax the upper bound in the uniqueness 
condition for the solution. Tang, Eldar and their colleagues 
[14], [16] showed that under certain mild assumptions the 
recovery rate increases exponentially with the number of 
measurement vectors L. Jin and Rao [15], [17] analyzed the 
benefits of increasing L by relating the MMV model to the 
capacity regions of MIMO communication channels. All these 
theoretical results reveal the advantages of the MMV model 
and support increasing L for better recovery performance. 

However, under the common sparsity assumption we cannot 
obtain many measurement vectors in practical applications. 
The main reason is that the sparsity profile of practical signals 
is (slowly) time-varying, so the common sparsity assumption 
is valid for only a small L in the MMV model. For example, in 
EEG/MEG source localization there is considerable evidence 
[18] that a given pattern of dipole-source distributions @ 
may only exist for 10-20 ms. Since the EEG/MEG sampling 
frequency is generally 250 Hz, a dipole-source pattern may 
only exist through 5 snapshots (i.e. in the MMV model 
L = 5). In DOA estimation [19], directions of targets 
are continuously changing, and thus the source vectors that 
satisfy the common sparsity assumption are few. Of course, 
one can increase the measurement vector number at the cost 
of increasing the source number, but a larger source number 
can result in degraded recovery performance. 

Thanks to numerous algorithms for the basic SMV model, 

'Here for convenience we call each row in X a source. The term is often 
used in application-oriented literature. Throughout the work, the i-th source 
is denoted by X;.. 

2 In this application the set of indexes of nonzero rows in X is called a 
pattern of dipole-source distribution. 

3 In this application the index of a nonzero row in X indicates a direction. 
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most MMV algorithms are obtained by straightforward 
extension of the SMV algorithms; for example, calculating 
the li norm of each row of X, forming a vector, and then 
imposing the sparsity constraint on the vector. These algo- 
rithms can be roughly divided into greedy algorithms [20], 
[21], algorithms based on mixed norm optimization [22]- 
[24], iterative reweighted algorithms [12], [25], and Bayesian 
algorithms [26], [27], 

Among the MMV algorithms, Bayesian algorithms have re- 
ceived much attention recently since they generally achieve the 
best recovery performance. Sparse Bayesian learning (SBL) 
is one important family of Bayesian algorithms. It was first 
proposed by Tipping [28], [29], and then was greatly enriched 
and extended by many researchers [25]-[27], [30]-[36]. For 
example, Wipf and Rao first introduced SBL to sparse signal 
recovery [30] for the SMV model, and later extended it to 
the MMV model, deriving the MSBL algorithm [26]. One 
attraction of SBL/MSBL is that, different from the popular 
l\ minimization based algorithms [37], [38], whose global 
minimum is generally not the sparsest solution [30], [39], the 
global minima of SBL/MSBL are always the sparsest one. 
In addition, SBL/MSBL have much fewer local minima than 
some classic algorithms, such as the FOCUSS family [5], [12]. 

Motivated by applications where signals and other types 
of data often contain some kind of structures, many algo- 
rithms have been proposed [13], [40]-[42], which exploit 
special structures in the source matrix X. However, most 
of these works focus on exploiting spatial structures (i.e. 
the dependency relationship among different sources) and 
completely ignore temporal structures. Besides, for tractability 
purposes, almost all the existing MMV algorithms (and theo- 
retical analysis) assume that the sources are independent and 
identically distributed (i.i.d.) processes. This contradicts the 
real- world scenarios, since a practical source often has rich 
temporal structures. For example, the waveform smoothness 
of biomedical signals has been exploited in signal processing 
for several decades. Besides, due to high sampling frequency, 
amplitudes of successive samplings of a source are strongly 
correlated. Recently, Zdunek and Cichocki [43] proposed 
the SOB-MFOCUSS algorithm, which exploits the waveform 
smoothness via a pre-defined smoothness matrix. However, 
the design of the smoothness matrix is completely subjective 
and not data-adaptive. In fact, in the task of sparse signal 
recovery, learning temporal structures of a source is a difficult 
problem. Generally, such structures are learned via a training 
dataset (which often contains sufficient data without noise for 
robust statistical inference) [44], [45]. Although effective for 
some specific signals, this method is limited. Having noticed 
that the temporal structures strongly affect the performance of 
existing algorithms, in [31] we derived the AR-SBL algorithm, 
which models each source as a first-order autoregressive (AR) 
process and learns AR coefficients from the data per se. 
Although the algorithm has superior performance compared 
to MMV algorithms in the presence of temporal correlation, 
it is slow, which limits its applications. As such, there is a 

4 For convenience, algorithms for the MMV model are called MMV 
algorithms; algorithms for the SMV model are called SMV algorithms. 



need for efficient algorithms that can deal more effectively 
with temporal correlation. 

In this work, we present a block sparse Bayesian learning 
(bSBL) framework, which transforms the MMV model (0 to 
a SMV model. This framework allows us to easily model the 
temporal correlation of sources. Based on it, we derive an 
algorithm, called T-SBL, which is very effective but is slow 
due to its operation in a higher dimensional parameter space 
resulting from the MMV-to-SMV transformation. Thus, we 
make some approximations and derive a fast version, called T- 
MSBL, which operates in the original parameter space. Similar 
to T-SBL, T-MSBL is also effective but has much lower com- 
putational complexity. Interestingly, when compared to MSBL, 
the only change of T-MSBL is the replacement of ||Xj. ||| with 
the Mahalanobis distance measure, i.e. Xi.B _1 X^, where B 
is a positive definite matrix estimated from data and can be 
partially interpreted as a covariance matrix. We analyze the 
global minimum and the local minima of the two algorithms' 
cost function. One of the key results is that in the noiseless 
case the global minimum is at the sparsest solution. Extensive 
experiments not only show the superiority of the proposed 
algorithms, but also provide some interesting (even counter- 
intuitive) phenomena that may motivate future theoretical 
study. 

The rest of the work is organized as follows. In Section 
Ull we present the bSBL framework. In Section [HI] we derive 
the T-SBL algorithm. Its fast version, the T-MSBL algorithm, 
is derived in Section [IV] Section [V] provides theoretical 
analysis on the algorithms. Experimental results are presented 
in Section [VI] Finally, discussions and conclusions are drawn 
in the last two sections. 

We introduce the notations used in this paper: 

• 1 1 3d ||i, ||x||2, ||A||jr denote the t\ norm of the vector x, 
the £2 norm of x, and the Frobenius norm of the matrix 
A, respectively. ||A||o and ||x||o denote the number of 
nonzero rows in the matrix A and the number of nonzero 
elements in the vector x, respectively; 

• Bold symbols are reserved for vectors and matrices. 
Particularly, 1l denotes the identity matrix with size 
L x L. When the dimension is evident from the context, 
for simplicity, we just use I; 

• diag{ai, • • • , a,M } denotes a diagonal matrix with 
principal diagonal elements being aiv,ajvf in 
turn; if Ai, • • • , Am are square matrices, then 
diag{Ai, • • • , Am} denotes a block diagonal matrix 
with principal diagonal blocks being Ai,--- , Am in 
turn; 

• For a matrix A, A$. denotes the i-th row, A.j denotes 
the i-th column, and A,* f j denotes the element that lies 
in the z-th row and the j-th column; 

« A ® B represents the Kronecker product of the two 
matrices A and B. vec(A) denotes the vectorization of 
the matrix A formed by stacking its columns into a single 
column vector. Tr(A) denotes the trace of A. A T denotes 
the transpose of A. 
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II. Block Sparse Bayesian Learning Framework The prior for x is given by 



Most existing works do not deal with the temporal corre- 
lation of sources. For many non-Bayesian algorithms, incor- 
porating temporal correlation is not easy due to the lack of 
a well defined methodology to modify the diversity measures 
employed in the optimization procedure. For example, it is 
not clear how to best incorporate correlation in i\ norm based 
methods. For this reason, we adopt a probabilistic approach to 
incorporate correlation structure. Particularly, we have found it 
convenient to incorporate correlation into the sparse Bayesian 
learning (SBL) methodology. 

Initially, SBL was proposed for regression and classification 
in machine learning [28]. Then Wipf and Rao [30] applied it to 
the SMV model (HJ for sparse signal recovery. The idea is to 
find the posterior probability p(x|y; 9) via the Bayesian rule, 
where indicates the set of all the hyperparameters. Given the 
hyperparameters, the solution x is given by the Maximum-A- 
Posterior (MAP) estimate. The hyperparameters are estimated 
from data by marginalizing over x and then performing 
evidence maximization or Type-II Maximum Likelihood [28]. 
To solve the MMV problem (O, Wipf and Rao [26] proposed 
the MSBL algorithm, which implicitly applies the £2 norm on 
each source Xj.. One drawback of this algorithm is that the 
temporal correlation of sources is not exploited to improve 
performance. 

To exploit the temporal correlation, we propose another SBL 
framework, called the block sparse Bayesian learning (bSBL) 
framework. In this framework, the MMV model is transformed 
to a block SMV model. In this way, we can easily model the 
temporal correlation of sources and derive new algorithms. 

First, we assume all the sources Xj. (W) are mutually 
independent, and the density of each Xj. is Gaussian, given 
by 



p(x;H,Bi,Vi) ~A/;(0,Eo), 



p(Xi. ; 7i , BO ~^/"(0,7 i B i 



i = ,M 



where 74 is a nonnegative hyperparameter controlling the row 
sparsity of X as in the basic SBL [26], [28], [30]. When 7, = 
0, the associated X,. becomes zeros. B; is a positive definite 
matrix that captures the correlation structure of X,. and needs 
to be estimated. 

By letting y = vec(Y T ) € R NLxl , D = * <g> I L , x = 
vec(X T ) e R MLxl , v = vec(V T ), we can transform the 
MMV model to the block SMV model 



y = Dx + v. 



(3) 



To elaborate the block sparsity model (|3), we rewrite it as 

y = [MIl,-- ,<t>M®W[xTr-- > x M] T + v = E i =i(^^ 
Il)x; + v, where <p i is the i-th column in and Xj 6 R Lxl 
is the z-th block in x and Xj = X^. K nonzero rows in X 
means K nonzero blocks in x. Thus x is block-sparse. 

Assume elements in the noise vector v are independent and 
each has a Gaussian distribution, i.e. p(vi) ~ M(0, A), where 
Vi is the i-th element in v and A is the variance. For the block 
model (O, the Gaussian likelihood is 

p(y|x; A) ~ A/" tf | x (Dx, AI). 



where S is 



So = 



7iBi 



7a/ B m 



(4) 



Using the Bayes rule we obtain the posterior density of x, 
which is also Gaussian, 

p(x|y;A,7i,Bi,Vi) = AT x (fi x ,Tl x ) 



with the mean 



H x = — SjjD y 



(5) 



and the covariance matrix 



1. 



(So 1 + -D T D) 



= S - S D T (AI + DS D T ) ^Sq. 



(6) 



So given all the hyperparameters A,7i,Bj,Vi, the MAP 
estimate of x is given by: 

x*4 Mx = (AS^+D^Dr^y 

= S D T (AI + DS D T )" 1 y (7) 



where the last equation follows the matrix identity (I + 
AB)- : A = A(I + BA)"\ and S is the block diagonal 
matrix given by (0]i with many diagonal block matrices being 
zeros. Clearly, the block sparsity of x* is controlled by the 
7i's in S : during the learning procedure, when 7^. = 0, the 
associated fc-th block in x* becomes zeros, and the associated 
dictionary vectors (fi k <g> 1^ are pruned out 0. 

To estimate the hyperparameters we can use evidence max- 
imization or Type-II maximum likelihood [28]. This involves 
marginalizing over the weights x and then performing maxi- 
mum likelihood estimation. We refer to the whole framework 
including the solution (O and the hyperparameter estimation 
as the block sparse Bayesian learning (bSBL) framework. Note 
that in contrast to the original SBL framework, the bSBL 
framework models the temporal structures of sources in the 
prior density via the matrices Bj (i = 1, • ■ • , M). Different 
ways to learn the matrices result in different algorithms. 
We will discuss the learning of these matrices and other 
hyperparameters in the following sections. 

III. Estimation of Hyperparameters 

Before estimating the hyperparameters, we note that as- 
signing a different matrix Bj to each source X^. will result 
in overfitting [46], [47] due to limited data and too many 
parameters. To avoid the overfitting, we consider using one 
positive definite matrix B to model all the source covariance 
matrices up to a scalar @. Thus Eq.Q becomes So = T ® B 

5 In practice, we judge whether 7^ is less than a small threshold, e.g. 10 — 5 . 
If it is, then the associated dictionary vectors are pruned out from the learning 
procedure and the associated block in x is set to zeros. 

6 Note that the covariance matrix in the density of X^. is 7jB^. 
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with r = diag(7i, • ■ ■ ,7a/)- Although this strategy is equiv- 
alent to assuming all the sources have the same correlation 
structure, it leads to very good results even if all the sources 
have totally different correlation structures (see Section IVII ). 
More importantly, this constraint does not destroy the global 
minimum property (i.e. the global unique solution is the 
sparsest solution) of our algorithms, as confirmed by Theorem 
Q] in Section |V] 

To find the hyperparameters = {71, • • • , jm, B, A}, we 
employ the Expectation-Maximization (EM) method to maxi- 
mize p(y; 0). This is equivalent to minimizing — logp(y; 0), 
yielding the effective cost function: 

£(6)=y T S; 1 y + log|S,|, (8) 

where T, y = AI + DSqD t . The EM formulation proceeds by 
treating x as hidden variables and then maximizing: 

Q(9) = E xly . e ( id)[\ogp(y,x;Q)] 



E 



[logp(y|x; A)] 



+ E x\y,e(°W [l°gp( x ;7i, ■ ■ ■ >7m,B)] (9) 

where 0( old ) denotes the estimated hyperparameters in the 
previous iteration. 

To estimate 7 = [71, • • • ,7a/] and B, we notice that the 
first term in (0 is unrelated to 7 and B. So, the Q function 
(0 can be simplified to: 

Q(7,B) = ^ hy:e(old) [logp(x;7,B)]. 

It can be shown thaj^l 



logp(x;7,B) oc -- log (|T| 
which results in 
Q( 7 ,B) « 



IB 



Tm-i 



-x J (r 

2 V 



® B" 



)x. 



L 
2 



M 



log (|r|) - — log (|B|) 



(r- 1 ®B- 1 )(s x + Ma;/ i^ 



,(10) 



where fi x and H x are evaluated according to (0 and (0, given 
the estimated hyperparameters 0( old ). 

The derivative of (TlOb with respect to 7; (i = 1, • ■ ■ , M) is 
given by 

dQ L 1 



rTr 



B 



i \T\ 



d lq ~ 2 7i 27? 

where we define (using the MATLAB notations) 

*4 - Vxitf -1)^+1 : %L) 
S* =Sa:((i-l)i + l : iL, (i - 1)L + 1 : iL) 

So the learning rule for 7, {i = 1, • • ■ , M) is given by 
TrfB-^ + ^O^H] 



(11) 



7i <~ 



M (12) 



On the other hand, the gradient of (fTOb over B is given by 



The <x notation is used to indicate that terms that do not contribute to the 
subsequent optimization of the parameters have been dropped. This convention 
will be followed through out the paper. 



Thus we obtain the learning rule for B: 

/4(/4) T 



B 



M ^ 



7i 



(13) 



To estimate A, the Q function (0 can be simplified to 



Q(A) 



E 



^.y-Qlold) 



logp(y|x; A)] 



K —g-logA- — ^^.Qtotd) []|y - Dx||l] 



— l0gA -2A 



|y-DMxlla 



+^| J/ .e(o 1 d)[||D(x-/iJ||^ 



= log A 

2 6 2A 

= log A 

2 6 2A 



ly-D^JII + TrrS^D) 



|y-D/i a ||a 



+ATr(S ;c (S; 1 - So 1 )) 

y-D/i^lla 
+A[Mi-Tr(S x So 1 )] n 



= log A 

2 6 2A 



(14) 



(15) 



where ( fl4l i follows from the first equation in (0, and A denotes 
the estimated A in the previous iteration. The A learning rule 
is obtained by setting the derivative of ( fT3T > over A to zero, 
leading to 



A 



AfATX-TrOMQ- 1 )] 



NL 



(16) 



where the A on the right-hand side is the A in (TT3T >- There are 
some challenges to estimate A in SMV models. This, however, 
is alleviated in MMV models when considering temporal 
correlation. We elaborate on this next. 

In the SBL framework (either for the SMV model or for the 
MMV model), many learning rules for A have been derived 
[26], [28], [30], [34]. However, in noisy environments some of 
the learning rules probably cannot provide an optimal A, thus 
leading to degraded performance. For the basic SBL/MSBL 
algorithms, Wipf et al [26] pointed out that the reason is that 
A and appropriate N nonzero hyperparameters 7$ make an 



identical contribution to the covariance S, 



ai + *r$ J 



in the cost functions of SBL/MSBL. To explain this, they 
gave an example: let a dictionary matrix <&' = [$ ,I], where 
e R NxM and * e R Nx ( M ~ N \ Then the A as well as 
the TV hyperparameters {tm-jv+Ij • ■ ■ ,7a/} associated with 
the columns of the identity matrix in $' are not identifiable, 
because 

S y = AI + $T*' T 

= AI + [* ,I]diag{7i, • • • ,7 A /}[*o,I] T 
= AI + * diag{7i, • • • ,jm-n] 
+diag{7A/_jv+i, • • • ,7a/} 



r}*rj 



indicating a nonzero value of A and appropriate values of the 
jV nonzero hyperparameters, i.e. jm-n+i, • • * > 7a/, can make 
an identical contribution to the covariance matrix S„. This 
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problem can be worse when the noise covariance matrix is 
diag(Ai, • ■ • , Xn) with arbitrary nonzero Xi, instead of AI. 

However, our learning rule (H&t does not have such ambi- 
guity problem. To see this, we now examine the covariance 
matrix H y in our cost function {8). Noting that D = <&' (8 I, 
we have 

S y = AI + DS D T 

= AI+(*'®I)(diag{ 7l ,--- , 7Af }®B)(*'®I) T 
= AI+ [* ®I,I®I](diag{7i.-- - ,7m} ®B) 
•[*o®I,I®I] T 

= AI + (* diag{ 7 i, • • • , 7m-w}*o) ® B 

+diag{j M -N+i, ■ ■ • , 7m} ® B. 

Obviously, since B is not an identity matrix 0, A and 
{iM-N+ii ' ' ' iIm} cannot identically contribute to H y . 

The SBL algorithm using the learning rules (|6), (|7), (TlZt . 
O and dT6j is denoted by J-SBL. 

IV. An Efficient Algorithm Processing in the 
Original Problem Space 

The proposed T-SBL algorithm has excellent performance 
in terms of recovery performance (see Section |Vl}. But it is not 
fast because it learns the parameters in a higher dimensional 
space instead of the original problem space 0. For example, 
the dictionary matrix is of the size NL x ML in the bSBL 
framework, while it is only of the size TV x M in the original 
MMV model. Interestingly, the MSBL developed for i.i.d. 
sources has complexity 0(N 2 M) and does not exhibit this 
drawback [26]. Motivated by this, we make a reasonable 
approximation and back-map T-SBL to the original space 

For convenience, we first list the MSBL algorithm derived 
in [26]: 



X 

7i 



IX* 



-1 


(17) 




(18) 


a, Vi 


(19) 



r* T (Ai 

i 

L ' 

An important observation is the lower dimension of the matrix 
operations involved in this algorithm. We attempt to achieve 
similar complexity for the T-SBL algorithm by adopting the 
following approximation: 



(AI W l+DS D t )" 



= (AIjVL 

« (AIjv ^ 



f (*r$ T )<g)B) 



(20) 



which is exact when A = 0orB = Ii. For high signal-to- 
noise ratio (SNR) or low correlation the approximation is quite 

8 Note that even all the sources are i.i.d. processes, the estimated B in 
practice is not an exact identity matrix. 

'T-SBL can be directly used to solve the block sparsity models [13], [22], 
[41]. In this case, the algorithm directly performs in the original parameter 
space and thus it is not slow (compared to the speed of some other algorithms 
for the block sparsity models). 

10 By back-mapping, we mean we use some approximation to simplify the 
algorithm such that the simplified version directly operates in the parameter 
space of the original MMV model. 



reasonable. But our experiments will show that our algorithm 
adopting this approximation performs quite well over a broader 
range of conditions (see Section fVB . 

Now we use the approximation to simplify the 7, learning 
rule (1121) , First, we consider the following term in (TT2t : 



1 



TrfB" 1 !]! 



1, 
L 



-Tr 



iJ-L 
T\-l 



7 f(</>f ®Il)(AI 



NL 



DS D ) {4>i ® It) • B 
>f(AI 



(21) 



2_ 

L 



N 

B 1 )] 



"(Vf(ALv 



n -7 



2 0f(Ai JV + *r* T ) V, 



(22) 



where d2"TT i follows the second equation in ©, and H x is given 
in ( PT7l ). Using the same approximation (l20l i. the fi x in ([T2b 
can be expressed as 

n x « (r®B)(* T ®i) 

•[(AI + *r* T ) _1 ®B- 1 ]vcc(Y T ) (23) 
= [r* T (AI + *r* T ) _1 ] (8) I • vec(Y T ) 

= vcc(y t (ai + *r* T ) _1 *r) 

= vcc(X T ) (24) 



where (l23l follows (0 and the approximation (l20l l. and X 
is given in ( [T8l . Therefore, based on (l22l and (l24l . we can 
transform the 7* learning rule ([L2l to the following form: 



1 



To simplify the B learning rule ([TBI , we note that 



(25) 



= So - SoD T (AI + DS D T )" 1 DSo 

= r®B- (r®B)(* T ®I)(AI + DS D T )- 1 

•(*®i)(r®B) 
« r®B- [(r$ T )®B][(Ai + *r* T )- 1 ®B- 1 ] 

•[(#r)®B] (26) 

= (r-r* T (Ai + *r* 7 y i $r) ®b 

= 3* <g> B, 



where ( I26l l uses the approximation (120V Using the definition 
dTTT >. we have = (H K )i;B. Therefore, the learning rule 
([T3l becomes: 



B 



M ^ 



7* 



B 



M 

- V 

M ^ 



7i 



(27) 



i=l ' i=l 
From the learning rule above, we can directly construct a fixed- 
point learning rule, given by 

1 ^4 x?%. 



B 



M(l - p) 



7i 
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where p = jjJ2i=i"fi 1 {^x)u- To increase the robustness, 
however, we suggest using the rule below: 



B 
B 



M 

E 



7i 



i=i 
B/||B|b 



(28) 
(29) 



where ( 1291 is to remove the ambiguity between B and 7$ 
(V«). This learning rule performs well in high SNR cases and 
noiseless cases HJ. However, in low or medium SNR cases 
(e.g. SNR < 20dB) it is not robust due to errors from the 
estimated 7$ and X^.. For these cases, we suggest adding a 
regularization item in B, namely, 



B 



M 

E 



x'x. 



(30) 



where 77 is a positive scalar. This regularized form (l30l ensures 
that B is positive definite. 

Similarly, we simplify the A learning rule (fT6] l as follows: 



|y-DMjll 



AfML-Tr^Eo 1 )] 



NL 

0^111 + Alt(S D T Sr 1 D) 



AL 



(31) 



J-|| 
.((AI 

al" 



Y-*X||^4 

+ *r$ T r 1 



*x[[3 ? - 



A 



-Tr 



NL 

SB" 1 )^ 

A 
A 



(r®B)(* 5 
i) 



i) 



(32) 

Tr[*r* T (Ai + *r* T )- 1 ] 

(33) 



where in OTb we use the first equation in ©, and in (1321 we 
use the approximation ( f2Qb . Empirically, we find that setting 
the off-diagonal elements of <&r<I> T to zeros further improves 
the robustness of the A learning rule in strongly noisy cases. 
In our experiments we will use the modified version when 
SNR < 20dB. 

We denote the algorithm using the learning rules ( flTt , (fTS), 
dBJ, d28l), (|29l) (or GJB), and 01 by T-MSBL (the name 
emphasizes the algorithm is a temporal extension of MSBL). 
Note that T-MSBL cannot be derived by modifying the cost 
function of MSBL. 

Comparing the ji learning rule of T-MSBL (Eq.(l25ll) with 
the one of MSBL (Eq.fO), we observe that the only change is 
the replacement of ||Xj. ||| with Xj.B _1 X?], which incorpo- 
rates the temporal correlation of the sources. Hence, T-MSBL 
has only extra computational load for calculating the matrix 
B and the item X i .B~ 1 X?' Since the matrix B has a 

1 1 Note that in )28t when the number of distinct nonzero rows in X 
is smaller than the number of measurement vectors, the matrix B is not 
invertible. But this case is rarely encountered in practical problems, since 
in practice the number of measurement vectors is generally small, as we 
explained previously. The presence of noise in practical problems also requires 
the use of the regularized form i30\ , which is always invertible. 

12 Here we do not compare the A learning rules of both algorithms, since 
in some cases one can feed the algorithms with suitable fixed values of A, 
instead of using the A learning rales. However, the computational load of the 
simplified A learning rale of T-MSBL is also not high. 



small size and is positive definite and symmetric, the extra 
computational load is low. 

Note that X;.B -1 X^ is the quadratic Mahalanobis distance 
between Xj. and its mean (a vector of zeros). In the following 
section we will get more insight into this change. 

V. Analysis of Global Minimum and Local Minima 

Since our bSBL framework generalizes the basic SBL 
framework, many proofs below are rooted in the theoretic work 
on the basic SBL [30]. However, some essential modifications 
are necessary in order to adapt the results to the bSBL 
model. Due to the equivalence of the original MMV model 
(H and the transformed block sparsity model (f3), in the 
following discussions we use (H or <[3j interchangeably and 
per convenience. 

Throughout our analysis, the true source matrix is denoted 
by Xgen, which is the sparsest solution among all the possible 
solutions. The number of nonzero rows in X gcn is denoted by 
Kq. We assume that X gcn is full column-rank, the dictionary 
matrix $ satisfies the URP condition [5], and the matrix B 
(or Bj,Vi) and its estimate are positive definite. 

A. Analysis of the Global Minimum 

We have the following result on the global minimum of the 
cost function dHJ [3 

Theorem 1: In the limit as A —> 0, assuming Kq < (A + 
L)/2 , for the cost function (d the unique global minimum 
7 — [71 5 " ' 7 7m] produces a source estimate X that equals to 
Xgcn irrespective of the estimated B^ , Vi, where X is obtained 
from vec(X T ) = x and x is computed using Eq.(|7]i. 

The proof is given in the Appendix. 

If we change the condition K < (A + L) /2 to K < A, 
then we have the conclusion that the source estimate X equals 
to X gcn with probability 1, irrespective of B^, Vi. This is due 
to the result in [48] that if A'o < A the above conclusion still 
holds for all X except on a set with zero measure. 

Note that 7 is a function of the estimated B; (Vi). However, 
the theorem implies that even when the estimated B^ is 
different from the true B^, the estimated sources are the true 
sources at the global minimum of the cost function. As a 
reminder, in deriving our algorithms, we assumed B^ = B (Vi) 
to avoid overfitting. Theorem [TJ ensures our algorithms using 
this strategy also have the global minimum property. Also, 
the theorem explains why MSBL has the ability to exactly 
recover true sources in noiseless cases even when sources are 
temporally correlated. But we hasten to add that this does 
not mean B is not important for the performance of the 
algorithms. For instance, MSBL is more frequently attracted 
to local minima than our proposed algorithms, as experiments 
show later. 

B. Analysis of the Local Minima 

In this subsection we discuss the local minimum property of 
the cost function C in (H with respect to 7 ^ [71, • • • , 7a/], 

13 For convenience, in this theorem we consider the cost function with So 
given by @, i.e. the one before we use our strategy to avoid the overfitting. 
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in which So = T ® B for fixed B. Before presenting our 
results, we provide two lemmas needed to prove the results. 

Lemma 1: log S. y = log |AI + DSoD T is concave with 
respect to 7. 

This can be shown using the composition property of 
concave functions [49], 

Lemma 2: y T S y : y equals a constant C when 7 satisfies 
the linear constraints 

A • (7 <g> li) = b (34) 

with 

b = y-Au (35) 

A 4 (# ® B)diag(D T u) (36) 

where A is full row rank, 1^ is an L x 1 vector of ones, and 
u is any fixed vector such that y T u = C. 

The proof is given in the Appendix. According to the 
definition of basic feasible solution (BFS) [50], we know 
that if 7 satisfies Equation ( |34l , then it is a BFS to (f34-b if 
IMIo < NL, or a degenerate BFS to (04} if ||7|| < NL. 
Now we give the following result: 

Theorem 2: Every local minimum of the cost function C 
with respect to 7 is achieved at a solution with I7H0 < NL, 
regardless of the values of A and B. 

The proof is given in the Appendix. 

Admittedly, the bound on the local minima H7H0 is loose, 
and it is not meaningful when NL > M, However, we 
empirically found that H7II0 actually is very smaller than NL. 

Now, we calculate the local minima of the cost function 
C. The result can provide some insights to the role of B. 
Particularly, we are more interested in the local minima 
satisfying ||7||o < N, since the global minimum satisfies 
H7II0 < N. For these local minima, we have the following 
result: 

Lemma 3: In noiseless cases (A — > 0), for every local 
minimum of C that satisfies H7H0 = K < N, its i-th nonzero 
element is given by 7m = -^Xj.B _1 X^ (i = 1, • • • , K), 
where Xj. is the i-th nonzero row of X and X is the basic 
feasible solution to Y = <J?X. 

The proof is given in the Appendix. 

From this lemma we immediately have the closed form of 
the global minimum. 

B actually plays a role of temporally whitening the sources 
during the learning of 7. To see this, assume all the sources 
have the same correlation structure, i.e. share the same matrix 
B. Let Z t . = Xj-B" 1 / 2 . From Lemma [3] at the global 
minimum we have 7^ = -^Z^.Z^ (i = 1, • • ■ , Ko). On the 
other hand, in the case of i.i.d. sources, at the global minimum 
we have 7^) = J-X.;.X^ (i = 1,- • • ,Kq). So the results for 
the two cases have the same form. Since E{ZjZi.} = jil, we 
can see in the learning of 7, B plays the role of whitening each 
source. This gives us a motivation to modify most state-of-the- 
art iterative re weighted algorithms by temporally whitening the 
estimated sources during iterations [32], [33]. 

VI. Computer Experiments 

Extensive computer experiments have been conducted and 
a few representative and informative results are presented. 



All the experiments consisted of 1000 independent trials. 
In each trial a dictionary matrix *& G M. NxM was created 
with columns uniformly drawn from the surface of a unit 
hypersphere (except the experiment in Section IVI-GK as 
advocated by Donoho et al [51]. And the source matrix 
Xgon £ R Mxi was randomly generated with K nonzero 
rows (i.e. sources). In each trial the indexes of the sources 
were randomly chosen. In most experiments (except to the 
experiment in Section IVI-Dt each source was generated as 
AR(1) process. Thus the AR coefficient of the i-th source, 
denoted by Pi, indicated its temporal correlation. As done in 
[20], [24], for noiseless cases, the £2 norm of each source was 
rescaled to be uniformly distributed between 1/3 and 1; for 
noisy cases, rescaled to be unit norm. Finally, the measurement 
matrix Y was constructed by Y = <J>X gen +V where V was a 
zero-mean homoscedastic Gaussian noise matrix with variance 
adjusted to have a desired value of SNR, which is defined by 
SNR(dB) 4 201og 10 (||*X gcn ||^/||V||^). 

We used two performance measures. One was the Failure 
Rate defined in [26], which indicated the percentage of failed 
trials in the total trials. In noiseless cases, a failed trial was 
recognized if the indexes of estimated sources were not the 
same as the true indexes. In noisy cases, since any algorithm 
cannot recover X gon exactly in these cases, a failed trial was 
recognized if the indexes of estimated sources with the K 
largest £2 norms were not the same as the true indexes. In 
noisy cases, the mean square error (MSE) was also used as 
a performance measure, defined by ||X — Xgonllfr/ll^gonll 2 ^ 
where X was the estimated source matrix. 

In our experiments we compared our T-SBL and T-MSBL 
with the following algorithms: 

. MSBL, proposed in [26] H 

. MFOCUSS, the regularized M-FOCUSS proposed in 
[12]. In all the experiments, we set its p-norm p = 0.8, 
as suggested by the authors [3 

. SOB -MFOCUSS, a smoothness constrained M-FOCUSS 
proposed in [43]. In all the experiments, we set its p-norm 
p = 0.8. For its smoothness matrix, we chose the identity 
matrix when L < 2, and a second-order smoothness 
matrix when L > 3, as suggested by the authors. Since 
in our experiments L is small, no overlap blocks were 
usedH 

• ISL0, an improved smooth £q algorithm for the MMV 
model which was proposed in [52]. The regularization 
parameters were chosen according to the authors' sug- 
gestions 0; 

• Reweighted £\j£2, an iterative reweighted £\j£2 algo- 
rithm suggested in [25]. It is an MMV extension of the 

14 The MATLAB code was downloaded at 

|http : //dsp .ucsd. edu/ -zhilin/MSBL_code ■ zip| 

15 The MATLAB code was downloaded at 

|http: / /dsp .ucsd.edu/-zhilin/MFOCUSS.ml 

I °The MATLAB code was provided by the first author of [43] in personal 
communication. In the code the second-order smoothness matrix S was 
defined as (in MATLAB notations): S = eyc(L) - 0.25 * (diag(e(l : 
L-l),-l) + diag(e(l : L- 1), 1) + (diag(e(l : L-2), -2) + diag(e(l : 
L — 2), 2))), where e is an L X 1 vector with ones. 

17 The MATLAB code was provided by the first author of [52] in personal 
communication. 
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iterative reweighted l\ algorithm [39] via the mixed £i/£2 
norm. The algorithm is given by 

1) Set the iteration count k to zero and «j| ' = 1, i = 
l,--- ,M 

2) Solve the weighted MMV i\ minimization problem 

M 

X (fc) = axgmin^iof^Xi.Ha s.t. Y = <f>X 

i=l 

3) Update the weights for each i = 1, ■ • • , M 

w ( fe+1 ) = I 

l|Xf|| 2 + 6« 

where is adaptively adjusted as in [39]; 

4) Terminate on convergence or when k attains a 
specified maximum number of iterations fc max . Oth- 
erwise, increment k and go to Step 2). 

For noisy cases, Step 2) is modified to 

M 

X (fc) = argmin^u4 fe) ||Xi.|| 2 s.t. ||Y - *X||^ < 5 

i=l 

Throughout our experiments, A; max = 5. We implemented 
it using the CVX optimization toolbox F^i 
In noisy cases, we chose the optimal values for the regular- 
ization parameter A in MFOCUSS and the parameter i5 in 
Reweighted ti/£2 by exhaustive search. Practically, we used 
a set of candidate parameter values and for each value we ran 
an algorithm for 50 trials, and then picked up the one which 
gave the smallest averaged failure rate. By comparing enough 
number of candidate values we could ensure a nearly optimal 
value of the regularization parameter for this algorithm. For 
T-MSBL, T-SBL and MSBL, we fixed A = 1(T 9 for noiseless 
cases, and used their A learning rules for noisy cases. Besides, 
for T-MSBL we chose the learning rule (130} with 77 = 2 to 
estimate B when SNR < 15dB. 

For reproducibility, the experi- 

ment codes can be downloaded at 
|http : / / dsp . ucsd . edu/ ~zhilin/TSBL_code . zip| 

A. Benefit from Multiple Measurement Vectors at Different 
Temporal Correlation Levels 

In this experiment we study how algorithms benefit from 
multiple measurement vectors and how the benefit is dis- 
counted by the temporal correlation of sources. The dictionary 
matrix <& was of the size 25 x 125 and the number of sources 
K = 12. The number of measurement vectors L varied from 
1 to 4. No noise was added. All the sources were AR(1) pro- 
cesses with the common AR coefficient B, such that we could 
easily observe the relationship between temporal correlation 
and algorithm performance. Note that for small L, modeling 
sources as AR(1) processes, instead of AR(p) processes with 
p > 1, is sufficient to cover wide ranges of temporal structure. 
We compared algorithms at six different temporal correlation 
levels, i.e. 8 = -0.9, -0.5, 0, 0.5, 0.9, 0.99. 

18 The toolbox was downloaded at: http://cvxr.com/cvx/ 
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Figure Q] shows that with L increasing, all the algorithms 
had better performance. But as \B\ —> 1, for all the compared 
algorithms the benefit from multiple measurement vectors 
diminished. One surprising observation is that our T-MSBL 
and T-SBL had excellent performance in all cases, no matter 
what the temporal correlation was. Notice that even sources 
had no temporal correlation {8 = 0), T-MSBL and T-SBL still 
had better performance than MSBL. 

Next we compare all the algorithms in noisy environments. 
We set SNR = 25dB while kept other experimental settings 
unchanged. The behaviors of all the algorithms were similar 
to the noiseless case. To save space, we only present the cases 
of ft = 0.7 and ft = 0.9 in FigEJ 

Since the performance of all the algorithms at a given 
correlation level B is the same as their performance at the 
correlation level — 8, in the following we mainly show their 
performance at positive correlation levels. 




Number of Measurement Vectors L Number of Measurement Vectors L 



(a) P = -0.9 (b) /3 = -0.5 




Number of Measurement Vectors L Number of Measurement Vectors L 



(c) P = (d) P = 0.5 




Number of Measurement Vectors L Number of Measurement Vectors L 



(e) P = 0.9 (f) P = 0.99 

Fig. 1. Performance of all the algorithms at different temporal correlation 
levels when L varied from 1 to 4. 

B. Recovered Source Number at Different Temporal Correla- 
tion Levels 

In this experiment we study the effects of temporal cor- 
relation on the number of accurately recovered sources in 
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Number of Measurement Vectors L Number of Measurement Vectors L 



(a) = 0.7 (b) P = 0.7 




Number of Measurement Vectors L Number of Measurement Vectors L 



(c) p = 0.9 (d) P = 0.9 

Fig. 2. Performance of all the algorithms at different temporal correlation 
levels when L varied from 1 to 4 and SNR was 25 dB. 

a noiseless case. The dictionary matrix <f> was of the size 
25 x 125. L was 4. K varied from 10 to 18. The sources 
were generated in the same manner as before. Algorithms 
were compared at four different temporal correlation levels, 
i.e. P = 0, 0.5, 0.9, and 0.99. Results (FigfJ show that T- 
MSBL and T-SBL accurately recovered much more sources 
than other algorithms, especially at high temporal correlation 
levels. This indicates that our proposed algorithms are very 
advantageous in the cases when the source number is large. 

C. Ability to Handle Highly Underdetermined Problem 

Most published works only compared algorithms in mildly 
underdetermined cases, namely, the ratio of M/N was about 
2^5. However, in some applications such as neuroimaging, 
one can easily have N « 100 and M w 100000. So, in 
this experiment we compare the algorithms in the highly 
underdetermined cases when N was fixed at 25 and M/N 
varied from 1 to 25. The source number K was 12, and 
the measurement vector number L was 4. SNR was 25 
dB. Different to previous experiments, all the sources were 
AR(1) processes but with different AR coefficients. Their AR 
coefficients were uniformly chosen from (0.5, 1) at random. 
Results are presented in Figj4] from which we can see that 
when M/N > 10, all the compared algorithms had large 
errors. In contrast, our proposed algorithms had much lower 
errors. Note that due to the performance trade-off between N 
and M, if one increases N, algorithms can keep the same 
recovery performance for larger M/N . 

D. Recovery Performance for Different Kinds of Sources 

In previous experiments all the sources were AR(1) pro- 
cesses. Although we have pointed out that for small L model- 
ing sources by AR(1) processes is sufficient, here we carry 




Nonzero Source Number K Nonzero Source Number K 



(a) p = (b) P = 0.5 




Nonzero Source Number K Nonzero Source Number K 



(c) p = 0.9 (d) p = 0.99 

Fig. 3. Failure rates of all the algorithms when K varied from 10 to 18 at 
different temporal correlation levels. 




Underdetermine Degree M/N 

Fig. 4. Performance comparison in highly underdetermined cases. 

out an experiment to show our algorithms maintaining the 
same superiority for various time series. Since from previous 
experiments we have seen that T-SBL has similar performance 
to T-MSBL, and that MSBL has the best performance among 
the compared algorithms, in this experiment we only compare 
T-MSBL with MSBL. 

The dictionary matrix was of the size 25 x 125. L was 4. 
K was 14. SNR was 25dB. First we generated sources as 
three kinds of AR processes, i.e. AR(p) (p = 1,2,3). All 
the AR coefficients were randomly uniformly chosen from 
the feasible regions such that the processes were stable. We 
examined the algorithms' performance as a function of the 
AR order p. Results are given in Fig|5] showing that T-MSBL 
again outperformed MSBL. With large p, the performance 
gap between the two algorithms increased. We repeated the 



TO APPEAR IN IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL.5, NO.5, 201 1 



10 



previous experiment with the same experiment settings except 
that we replaced the AR(p) sources by moving-averaging 
sources MA(p) (p = 1,2,3). The MA coefficients were 
uniformly chosen from (0, 1] at random. Again, we obtained 
the same results. These results imply that our algorithms 
maintain their superiority for various temporally structured 
sources, not only AR processes. 




Order p Order p 



(a) (b) 

Fig. 5. Performance of T-MSBL and MSBL for different AR(p) sources and 
different MA(p) sources measured in terms of MSE and failure rates. 



E. Recovery Ability at Different Noise Levels 




SNR(dB) SNR(dB) 



(a) (b) 
Fig. 6. Performance of various algorithms at different noise levels. 

From previous experiments we have seen that the proposed 
algorithms significantly outperformed all the compared algo- 
rithms in noiseless scenarios and mildly noisy cases, even 
though to derive T-MSBL we used the approximation (|20| > 
which takes the equal sign only when B = I (no temporal 
correlation) or A = (no noise). Some natural questions may 
be raised: What is the performance of T-SBL and T-MSBL in 
strongly noisy cases? Is it still beneficial to exploit temporal 
correlation in these cases? To answer these questions, we carry 
out the following experiment. 

The dictionary matrix was of the size 25 x 125. The number 
of measurement vectors L was 4. The source number K was 
7. All the sources were AR(1) processes and the temporal 
correlation of each source was 0.8. SNR varied from 5 dB 
to 15 dB. The experiment was repeated 2000 trials. We com- 
pared the proposed T-SBL, T-MSBL with three representative 
algorithms, i.e. MSBL, MFOCUSS, and Reweighted i 1 /t 2 - 

Note that in low SNR cases, the estimated B of T-SBL 
and T-MSBL can include large errors, and thus the estimated 



amplitudes of sources are distorted. To reduce the distortion, 
we set B = I once the number of nonzero 7, was less than N 
during the learning procedure. The reason is that the role of B 
is to prevent T-SBL/T-MSBL from arriving at local minima; 
once the algorithms approach global minima very closely, B 
is no longer useful. 

Also note that the A learning rules of T-SBL, T-MSBL and 
MSBL may not lead to optimal performance in low SNR cases. 
To avoid the potential disturbance of these A learning rules, 
we provided the three SBL algorithms with the optimal A*'s, 
which were obtained by the exhaustive search method stated 
previously. 

Figure|6]shows that T-SBL and T-MSBL outperformed other 
algorithms in all the noise levels. This implies that even in 
low SNR cases exploiting temporal correlation of sources is 
beneficial. 

But we want to emphasize that although the A learning 
rules of the three SBL algorithms may not be optimal in 
low SNR cases, our proposed A learning rules can lead to 
near-optimal performance, compared to the one of MSBL. 
To see this, we ran T-MSBL and MSBL again, but this time 
both algorithms used their A learning rules. T-MSBL used the 
modified version of the A learning rule (I33t . i.e. setting the 
off-diagonal elements of <&r<I> T to zeros. The results (Fig. |6]l 
show that MSBL had very poor performance when using its 
A learning rule. In contrast, T-MSBL's performance was very 
close to its performance when using its optimal A* The 
results indicate our proposed algorithms are advantageous in 
practical applications, since in practice the optimal A*'s are 
difficult to obtain, if not impossible. 

F. Temporal Correlation: Beneficial or Detrimental? 

From previous experiments one may think that temporal 
correlation is always harmful to algorithms' performance, at 
least not helpful. However, in this experiment we will show 
that when SNR is high, the performance of our proposed 
algorithms increases with increasing temporal correlation. 

We set N = 25, L = 4, K = 14, and SNR = 50dB. The 
underdeterminacy ratio M/N varied from 5 to 20. Sources 
were generated as AR(1) processes with the common AR 
coefficient f3. We considered the performance of T-MSBL and 
MSBL in three cases, i.e. the temporal correlation /? was 0, 
0.5, and 0.9, respectively. Results are shown in FigJT] As ex- 
pected, the performance of MSBL deteriorated with increasing 
temporal correlation. But the behavior of T-MSBL was rather 
counterintuitive. It is surprising that the best performance of T- 
MSBL was not achieved at /3 = 0, but at (3 = 0.9. Clearly, high 
temporal correlation enabled T-MSBL to handle more highly 
underdetermined problems. For example, its performance at 
M/N = 20 with (3 = 0.9 was better than that at M/N = 15 
with j3 = 0.5 or (3 = 0. The same phenomenon was observed 
in noiseless cases as well, and was observed for T-SBL. 

The results indicating that temporal correlation is helpful 
may appear counterintuitive at first glance. A closer exami- 
nation of the sparse recovery problems indicates a plausible 

"T-SBL had the same behavior. But for clarity we do not present its 
performance curve. 
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M/N 

Fig. 7. Behaviors of MSBL and T-MSBL at different temporal correlation 
levels when SNR = 50dB. 

explanation. There are two elements to the sparse recovery 
task; one is the location of the nonzero entries and the other 
is the value for the nonzero entries. Both tasks interact and 
combine to determine the overall performance. Correlation 
helps the estimation of the values for the nonzero entries and 
this may be important for the problem when dealing with finite 
matrices and may be lost when dealing with limiting results as 
the matrix dimension go to infinity. A more rigorous study of 
the interplay between estimation of the values and estimation 
of the locations is an interesting topic. 

G. An Extreme Experiment on the Importance of Exploiting 
Temporal Correlation 

It may be natural to take for granted that in noiseless cases, 
when source vectors are almost identical, algorithms have 
almost the same performance as in the case when only one 
measurement vector is available. In the following we show 
that it is not the case. 

We designed a noiseless experiment. First, we generated a 
Hadamard matrix of the size 128 x 128. From the matrix, 
40 rows were randomly selected in each trial and formed a 
dictionary matrix of the size 40 x 128. The source number 
K was 12, and the measurement vector number L was 3. 
Sources were generated as AR(1) processes with the common 
AR coefficient /3, where /3 = sign(C)(l — 10~' c '). We varied 
C from -10 to 10 in order to see how algorithms behaved 
when the absolute temporal correlation, |/3|, approximated to 
1. 

Figure [8] (a) shows the performance curves of T-MSBL 
and MSBL when |/3| — > 1, and also shows the performance 
curve of MSBL when ,9 = 1. We observe an interesting 
phenomenon. First, as |/3| — > 1, MSBL's performance closely 
approximated to its performance in the case of (3 = 1. It 
seems to make sense, because when |/?| — > 1, every source 
vector provides almost the same information on locations 
and amplitudes of nonzero elements. Counter-intuitively, no 
matter how close was to 1, the performance of T-MSBL 
did not change. Figure [8] (b) shows the averaged condition 



numbers of the submatrix formed by the sources (i.e. nonzero 
rows in X gcn ) at different correlation levels. We can see that 
the condition numbers increased with the increasing temporal 
correlation. This suggests that T-MSBL was not sensitive to the 
ill-condition issue in the source matrix, while MSBL is very 
sensitive. Although not shown here, we found that T-SBL had 
the same behavior as T-MSBL, while other MMV algorithms 
had the same behaviors as MSBL. The phenomenon was 
also observed when using other dictionary matrices, such as 
random Gaussian matrices. 

These results emphasize the importance of exploiting the 
temporal correlation, and also motivate future theoretical stud- 
ies on the temporal correlation and the ill-condition issue of 
source matrices. 




-10 -8 -6 -4 -2 2 4 6 8 10 -10 -8 -6 -4 -2 2 4 6 8 10 

Correlation Index C Correlation Index C 



(a) (b) 

Fig. 8. (a) The performance and (b) the condition numbers of the submatrix 
formed by sources when the temporal correlation approximated to 1. The 
temporal correlation /3 = sign(C)(l — 10~l c ), where C was the correlation 
index varying from -10 to 10. 



VII. Discussions 

Although there are a few works trying to exploit temporal 
correlation in the MMV model, based on our knowledge 
no works have explicitly studied the effects of temporal 
correlation, and no existing algorithms are effective in the 
presence of such correlation. Our work is a starting point in 
the direction of considering temporal correlation in the MMV 
model. However, there are many issues that are unclear so far. 
In this section we discuss some of them. 

A. The Matrix B: Trade-off Between Accurately Modeling and 
Preventing Overfitting 

In our algorithm development we used one single matrix B 
as the covariance matrix (up to a scalar) for each source model 
in order to avoid overfitting. Mathematically, it is straight- 
forward to extend our algorithms to use multiple matrices to 
capture the covariance structures of sources. For example, one 
can classify sources into several groups, say G groups, and 
the sources in a group are all assigned by a common matrix 
Bj (i = 1, • • • , G, G -C M) as the covariance matrix (up to 
a scalar). It seems that this extension can better capture the 
covariance structures of sources while still avoiding overfitting. 
However, we find that this extension (even for G = 2) has 
much poorer performance than our proposed algorithms and 
MSBL. One possible reason is that during the early stage of 
the learning procedure of our algorithms, the estimated sources 
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from each iteration are far from the true sources, and thus 
grouping them based on their covariance structures is difficult, 
if not impossible. The grouping error may cause avalanche 
effect, leading to the noted poor performance. Reducing the 
grouping error and more accurately capturing the temporal 
correlation structures is an area for future work. 

However, as we have stated, B plays a role of whitening 
each source. In our recent work [32], [33] we found that the 
operation Xj.B _1 X|^ (Vi) can replace the row-norms (such 
as the £2 norm and the £oo norm) in iterative reweighted 
£2 and £\ algorithms for the MMV model, functioning as 
a row regularization. This indicates that using one single 
matrix B may be a better method than using multiple matrices 
Bi, • ,B G . 

On the other hand, there may be many ways to parameterize 
and estimate B. In this work we provide a general method to 
estimate B. In [31] we proposed a method to parameterize B 
by a hyperparameter j3, i.e., 



B 



P 
1 



P 



L-l 
?L-2 



ft 



L-l 



P 



L-2 



1 



which equivalently assumes the sources are AR(1) processes 
with the common AR coefficient (3. The resulting algorithms 
have good performance as well. Also, for low SNR cases in our 
experiments, we added an identity matrix (with a scalar) to the 
estimated B in T-MSBL, and achieved satisfying performance. 
All these imply that B could have many forms. Finding the 
forms that are advantageous in strongly noisy environments is 
an important issue and needs further study. 



B. The Parameter A: Noise Variance or Regularization Pa- 
rameter? 

In our algorithms the covariance matrix of the multi-channel 
noise V., (i = 1, • • ■ , L) is AIjv with the implicit assumption 
that each channel noise has the same variance A. It is straight- 
forward to extend our algorithms to consider the general noise 
covariance matrix diag([Ai, • • • , Ajv]), i.e. assuming different 
channel noise have different variance. However, this largely 
increases parameters to estimate, and thus we may once again 
encounter an overfitting problem (similar to the overfitting 
problem in learning the matrix B;). 

Some works [34], [53] considered alternative noise covari- 
ance models. In [34] the authors assumed that the covariance 
matrix of multi-channel noise is AC, instead of AIjv, where 
C is a known positive definite and symmetric matrix and 
A is an unknown noise-variance parameter. This model may 
better capture the noise covariance structures, but generally 
one does not know the exact value of C. Thus there is no 
clear benefit from this covariance model. In [53], instead of 
deriving a learning rule for the noise covariance inside the 
SBL framework, the authors estimated the noise covariance by 
a method independent of the SBL framework. But this method 
is based on a large number of measurement vectors, and has 
a high computational load. 



On the other hand, due to the works in [25], [27], [54], 
which connected SBL algorithms to traditional convex relax- 
ation methods such as Lasso [37] and Basis Pursuit Denoising 
[38], it was found that A is functionally the same as the reg- 
ularization parameters of those convex relaxation algorithms. 
This suggests the use of methods such as the modified L-curve 
procedure [55] or the cross-validation [37], [38] to choose A 
especially in strongly noisy environments. It is also interesting 
to see that SBL algorithms could adopt the continuation 
strategies [56], [57], used in Lasso-type algorithms, to adjust 
the value of A for better recovery performance or faster speed. 

However, if some channels contain very large noise (e.g. 
outliers) and the number of such channels is very small, then 
as suggested in [58], we can extend the dictionary matrix <I> to 
[<&, I] and perform any sparse signal recovery algorithms with- 
out modification. The estimated 'sources' associated with the 
identity dictionary matrix are these large noise components. 

C. Connections to Other Models 

In fact, our bSBL framework is a block sparsity model 
[13], [22], [41], and thus the derived T-SBL algorithm can 
be directly used for this model. Compared to most existing 
algorithms derived in this model [22], [41], [59], an important 
difference is that T-SBL considers the correlation within each 
block. 

The time-varying sparsity model [60], [61] is another re- 
lated model. Different to our MMV model that assumes the 
support of each source vector is the same, the time-varying 
sparsity model assumes the support is slowly time-varying. It 
is interesting to note that this model can be approximated by 
concatenation of several MMV models, where in each MMV 
model the support does not change. Thus our proposed T- 
SBL and T-MSBL can be used for this model. The results are 
appealing, as shown in our recent work [33]. 

It should be noted that the proposed algorithms can be 
directly used for the SMV model. In this case the matrix B 
reduces to a scalar, and the 7; learning rules are the same 
as the one in the basic SBL algorithm [30]. But due to the 
effective A learning rules, our algorithms are superior to the 
basic SBL algorithm, especially in noisy cases. 

VIII. Conclusions 

We addressed a multiple measurement vector (MMV) model 
in practical scenarios, where the source vectors are temporally 
correlated and the number of measurement vectors is small due 
to the common sparsity constraint. We showed that existing 
algorithms have poor performance when temporal correlation 
is present, and thus they have limited ability in practice. To 
solve this problem, we proposed a block sparse Bayesian 
learning framework, which allows for easily modeling the 
temporal correlation and incorporating this information into 
derived algorithms. Based on this framework, we derived two 
algorithms, namely, T-SBL and T-MSBL. The latter can be 
seen as an extension of MSBL by replacing the £2 norm 
imposed on each source with a Mahalanobis distance mea- 
sure. Extensive experiments have shown that the proposed 
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algorithms have superior performance to many state-of-the- 
art algorithms. Theoretical analysis also has shown that the 
proposed algorithms have desirable global and local minimum 
properties. 
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Appendix 
A. Outline of the Proof of Theorem Q] 

Since the proof is a generalization of the Theorem 1 in [53], 
we only give an outline. 

For convenience we consider the equivalent model (|3}- Let 
x be computed using x = (ASq 1 +D T D) _1 D T y with So = 
diag{7iBi, ■ • • ,7a/B m }, and 7 = [71, •• ■ ,7a/] ^obtained 
by globally minimizing the cost function for given (Vi) 

£( 7j )=y T S^ 1 y + log|S y |. 

It can be shown [53] that when A — > (noiseless case), the 
above problem is equivalent to 

min : g(x) = min ^Ep^x + log IXLll (37) 

•7 

s.t. : y = Dx (38) 

So we only need to show the global minimizer of d37l i satisfies 
the property stated in the theorem. 

Assume in the noiseless problem Y = <I>X, <I> satisfies the 
URP condition [5]. For its any solution X, denote the number 
of nonzero rows by K. Thus following the method in [53], 
we can show the above g(x) satisfies 

gr(x) = 0(1)+ (NL - min[JVi, KL\) log A, (39) 

providing B^ is full rank. Here we adopt the notation f(s) = 
O(l) to indicate that |/(s)| < C\ for all s < C2, with C\ 
and C2 constants independent of s. Therefore, by globally 
minimizing d39l ). i.e. globally minimizing d37l i. K will achieve 
its minimum value, which will be shown to be Kq, the number 
of nonzero rows in X gcn . 

According to the result in [6], [12], if X gon satisfies 

v , N + L 
A <^ 

then there is no other solution (with K nonzero rows) such that 
Y = <&X with K < So, K > Kq, i.e. the minimum 

value of K is Kq. Once K achieves its minimum, we have 
X = X gcn . 

20 In the proof we fix because we will see has no effect on the 
global minimum property. 



In summary, the global minimum solution 7 leads to the 
solution that equals to the unique sparsest solution X gcn . And 
we can see, providing B; is full rank, it does not affect the 
conclusion. 

B. Proof of Lemma [2] 

Re-write the equation y T S,T x y = C by y T u = C, where 
u = E^y = (AI + DS D T ) y, from which we have 
y - Au = DS D T u = D(T ® B)D T u = D(I M ® B)(T <g> 
I L )D T u = D(I M $ B)diag(D T u)diag(r <8 I L ) = (* ® 
B)diag(D T u)(7 ® 1^). It can be seen that the matrix A = 
(<& ® B)diag(D T u) is full row rank. 

C. Proof of Theorem [2] 

The proof follows along the lines of Theorem 2 in [30] 
using our Lemma Q] and Lemma [2] Consider the optimization 
problem: 

min: /( 7 ) 4 log |AI + DS D T | 
s.t.: A-(7®l L ) = b (40) 
7^0 

where A and b are defined in Lemma |2] From Lemma Q] 
and Lemma |2] we can see the optimization problem (l40b 
is optimizing a concave function over a closed, bounded 
convex polytope. Obviously, any local minimum of C, e.g. 
7*, must also be a local minimum of the above optimization 
problem with C = y T (AI + D(r* ® B)D T )~ 1 y> where 
T* = diag(7*). Based on the Theorem 6.5.3 in [50] the 
minimum of (l40l > is achieved at an extreme point. Further, 
based on the Theorem in Chapter 2.5 of [50] the extreme 
point is a BFS to 

f A • (7® 1 L ) = b 

which indicates H7H0 < NL. 

D. Proof of Lemma \3\ 

For convenience we first consider the case of K = N. Let 
7 be the vector consisting of nonzero elements in 7, and $ be 
a matrix consisting of the columns of # whose indexes are the 
same as those of nonzero elements in 7. Thus, the equation 
Y = <&X can be rewritten as Y = 4>X. By transferring it 
to its equivalent block sparse Bayesian learning model, we 
have y = Dx, where y = vec(Y T ), D = $ (g) Ij,, and 
x = vec(X T ). Since D is a square matrix with full rank, we 
have x = D _1 y. For convenience, let = 3cru_ 1 \ L+1 . iL ], i.e. 
Xj consists of elements of x with indexes from (i — 1)L + 1 
to iL. Now consider the cost function C, which becomes 

£(7) = H - +^log7i) +Mlog|B| 

4=1 " 

+21og|D|. 
Letting = gives 

1% = -jrx[ B" 1 ^, i = !,•■•, K 
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The second derivative of C at 7s = -^x^B 1 x i is given by 



a 2 £( 7 ) 



B-'x, 



> 0. So 



Since B is positive definite and Xj ^ 0, 

7i = ixf B" 1 ^ (i = 1, • • ■ , A") is a local minimum. 

If ||7|[o = K < N, which implies there exists xje R KLxl 
such that y = Dx, then we can expand the matrix D to a full- 
rank square matrix [D, D e ] by adding an arbitrary full column- 
rank matrix D R . And we expand x to [x T ,e T l T , where 



— > 



eg R (N-K)Lxi and £ ^ o Therefore, [D, D e ][x T , e T ] T 
Dx = y. Similarly, we also expand 7 to [7 , £ T ] T with 
£ — > 0. Then, following the above steps, we can obtain the 
same result. Therefore, we finish the proof. 
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